PART ONE

Question

• CONTEXT: Medical research university X is undergoing a deep research on patients with certain conditions. University has an internal AI team. Due to confidentiality the patient’s details and the conditions are masked by the client by providing different datasets to the AI team for developing a AIML model which can predict the condition of the patient depending on the received test results.

• DATA DESCRIPTION: The data consists of biomechanics features of the patients according to their current conditions. Each patient is represented in the data set by six biomechanics attributes derived from the shape and orientation of the condition to their body part.

    1. P_incidence
    2. P_tilt
    3. L_angle
    4. S_slope
    5. P_radius
    6. S_degree
    7. Class

1. Import and warehouse data:

• Import all the given datasets and explore shape and size of each.

• Merge all datasets onto one and explore final shape and size.

>> Import all the given datasets

>> Explore shape and size of each

Shape and size of individual files :

All the data have equal number of column. 
Normal data has 100 row
Type_H data has 60 row
Type_S data has 150 row

>> Check the first 5 column with head()

>> Merge all datasets onto one

>> Explore final shape and size

Overall the data has 310 columns and 7 rows

2. Data cleansing:

• Explore and if required correct the datatypes of each attribute

• Explore for null values in the attributes and if required drop or impute values.

>>Explore data type

All the column except Class is float.

Since Class is categorical variable we can change the data type

>> Explore for null values in the attributes

There is no null data present in the data

>> Get info on each category

>> Explore the categorical value and standardise the categories

Here class are names are not standardised so changing the Nrmal to Normal, type_h to Type_H and tp_s to Type_S.

1. All the columns have same data type except Class
2. There are 310 samples.
3. Since all the features are numeric and complete, there is no need for pre-processing such as converting data into numbers and filling in missing information.
4. Since the value ranges of the features in the feature columns are very different from each other, scaling should be done. 

3. Data analysis & visualisation:

• Perform detailed statistical analysis on the data.

• Perform a detailed univariate, bivariate and multivariate analysis with appropriate detailed comments after each analysis.

Statistical Analysis

Statistical analysis

    1. P_incidence has 310 data, with mean as 60.497, standard deviation of 17.23, minimum value of 26.148 and max of 129.83. Median value is around 58.69. 

Mean and median are almost same, hence this is normal data with very little skewness.

    2. P_tilt has 310 data, with mean of 17.54, standard deviation of 10, min and max value of -6.555 and 49.43 respectively. Median value is around 16.358.

Here mean and median have very little difference, hence this is normal data with very little skewness.

    3. L_angle has 310 data, with mean as 51.93, standard deviation of 18.55, minimum value of 14 and max of 125.74. Median value is around 49.562

Here mean and median have very little difference, hence this is normal data with very little skewness.

    4. S_slope has 310 data, with mean of 42.95, standard deviation of 13.423, min and max value of 13.36 and 121.43 respectively. Median value is around 42.4

Here mean and median have very little difference, hence this is normal data with very little skewness.

    5. P_radius has 310 data, with mean as 117.92, standard deviation of 13.317, minimum value of 70.1 and max of 163.1. Median value is around 118.27

Here mean and median have very little difference, hence this is normal data with very little skewness.

    6. S_Degree has 310 data, with mean of 26.297, standard deviation of 37.55, min and max value of -11.06 and 418.54 respectively. Median value is around 11.76

Here mean and median have higher difference, The data seems to be skewed with few very high outliers(Max : 418.543082 and 75%: 41.29 )

Univariate Analysis

There is one property in the dataset that contains discrete values; class. The chart types we can use for the single discrete value distribution are; countplot(which is pandas bar graph) and percentage distribution.

1. Above figure shows the distribution of count of class category
2. Here we can observed that 48.4% of data is Type_S, 32.3% data is Normal class and 19.4% data is Type_H class

Single Continuous Variable Distribution (Univariate Visualization)

The distribution of the pelvic_incidence variable is close to the normal (Gaussian) distribution. The variable distribution is slightly more to the right of the mean. Most machine learning models do better on normally distributed datasets.

The data except S_Degree are normally distributed.S_Degree is skewed on right

The table above shows the distribution of all continuous variables.

The data except S_Degree are normally distributed.S_Degree is skewed on right

Bi Variate Analysis

P_incidence IQR is higher for Type_S class. Type_H and normal class have overlapping IQR of P_incidence

Mean of P_incidence for Type_S is highest and Type_H is lowest

There are few outliers in Type_S and Type_H class

IQR of p_tilt is overlapping for 3 classes

Mean of Type_S tilt is highest followed by Type_H and Normal

There are few outliers for normal and Type_H

L_angle is quite similar distribution among classes as P_incidence

L_angle IQR is higher for Type_S class. Type_H and normal class have overlapping IOR of L_angle

Mean of L_angle for Type_S is highest and Type_H is lowest for Type_H

There are few outliers in Type_S, Normal and Type_H class

S_slope is quite similar distribution among classes as P_incidence and L_angle

S_slope IQR is higher for Type_S class. Type_H and normal class have overlapping IQR of S_slope

Mean of S_slope for Type_S is highest and Type_H is lowest for Type_H

There are few outliers in Type_S and Normal class

Mean P_radius is highest for Normal followed by Type_H and Type_S

There are few outlier for all the classes

There are lots of large outliers for Type_S class for S_Degree. Type_S has very large outliers

Mean of S_Degree for normal and type_H is very less compared to Type_S due to outliers

Showing the Relationship of the Features with Each Other

We can observe the relationship between all variables with pairplot and correlation matrix.

Pairplot shows the relationships of all variables with each other with scatter graphs. If there is a linear relationship between two variables, removing one of these variables is likely to have a positive effect on the performance of the machine learning model.

Correlation matrix determines the direction and strength of the relationship between variables . It presents the relationship presented by the piarplot plot with more clear and numerical values. The correlation matrix is ​​the best chart type for interpreting the data set as a whole.

There is high correlation between P_incidence vs L_angle and P_incidence vs S_slope. Correlation between L_angle and S_slope is 0.6 which nominal

Class Distributions and Distribution of Property Values ​​Over Classes

'Class' variable is a discrete value, the drawing types used in the visualization of categorical data will be preferred. The plot types used for categorical values ​​in the Seaborn library are: stripplot(), swarmplot(), boxplot(), violinplot(), boxenplot() pointplot(), barplot(), countplot().

First, let's use the countplot() chart type to learn the class distributions. Next, let's show the relationship between the output variable and each input variable with swarmplot

Multivariate Analysis

Helper Function: draw_multivarient_plot

1. interpret the relations of the properties in the dataset with the class;
2. Violin Graphics (Violin Plot)
3. Box Graph (Box Plot)
4. Point Chart (Point Plot)
5. Bar Graph (Bar plotters) 

Here we can see that distribution of the different variable against Class(Categorical Variable)

Except S_Degree, the distribution is pretty much normal.

Here we can see the mean for distribution for different class type

With respect to mean :

P_incidence : Type_S has maximum P_incidence and Type_H has minimum P_incidence.

P_tilt : Type_S again has maximum P_tilt and Normal has minimum P_tilt

L_angle : Type_S has maximum L_angle and Type_H has minimum L_angle.

S_slope : Type_S has maximum S_slope and Type_H has minimum S_slope.

P_radius : normal has maximum P_radius and Type_S has minimum P_radius.

S_Degree : normal and Type_H has almost equal S_Degree mean but Type_S has highest mean.

There is linear relation between P_incidence vs L_angle and P_incidence vs S_slope.

There is nominal linearity between L_angle and S_slope

We can compare the result and remove the highly related data after hypothesis testing.

Hypothesis Testing

We can check whether all the independent variables has significant effect on target variables

There is huge difference in S_Degree for Type_S class

Hypothesis Testing of continuous feature with target variable

Here we will be using two-sample unpaired t-test

Ho(Null Hypothesis):There is no significant difference in independent feature with different category of Target variable

H1(Alternate Hypothesis):There is significant difference in independent feature with different category of Target variable

We can see s_degree has no significant effect on determining category of Type_H and normal

Also P_radius has no no significant effect on determining category of Type_H and Type_S

Hypothesis Conlusions:

We can see s_degree has no significant effect on determining category of Type_H and normal

Also P_radius has no no significant effect on determining category of Type_H and Type_S

After Hypothesis testing, its clear that all continuous data have influence in determining the Class. Hence we are keeping all the variables

Though there is linearity between the variable statistically. After hypothesis testing its revealed that the overall all the variable help in deciding the class category. Hence its not ideal to drop any variables

4. Data pre-processing:

• Segregate predictors vs target attributes

• Perform normalisation or scaling if required.

• Check for target balancing.

• Perform train-test split.

• Segregate predictors vs target attributes

1. x: features
2. y: target variables (normal,type_h,type_s)

• Perform normalisation or scaling if required.

Convert the features into z scores as we do not know what units / scales were used and store them in new dataframe

It is always adviced to scale numeric attributes in models that calculate distances.

Checking Outliers

We can remove the outliers for P_tilt,P_radius,S_Degree

• Check for target balancing and fix it if found imbalanced.

Here the data is n the ratio Normal:Type_H:Type_S = 10:6:15

If the imbalanced data is not treated beforehand, then this will degrade the performance of the classifier model. Most of the predictions will correspond to the majority class and treat the minority class features as noise in the data and ignore them. This will result in a high bias in the model.

Since data size is very less we will be oversampling the data

• Perform train-test split.

Now we have data ready for training and testing

ACTIONS PERFORMED:

    1. Segregate Data(feature and target)
    2. Scale data
    3. Remove outliers
    4. Sampled Data 
    5. Train-test split


Final Data:

    1. Training Data without sampling : 
        > x_train : 217 data with scaled and removed outliers
        > y_train : 217 data with scaled and removed outliers
    2. Testing Data without sampling : 
        > x_test : 93 data with scaled and removed outliers
        > y_test : 93 data with scaled and removed outliers
    3. Training Data with sampling : 
        > x_train_res : 315 data with scaled and removed outliers
        > y_train_res : 315 data with scaled and removed outliers
    4. Testing Data with sampling : 
        > x_test_res : 135 data with scaled and removed outliers
        > y_test_res : 135 data with scaled and removed outliers


   Total non-sampled dataset : 310
   Total sampled dataset : 450

5. Model training, testing and tuning:

• Design and train a KNN classifier.
• Display the classification accuracies for train and test data.
• Display and explain the classification report in detail.
• Automate the task of finding best values of K for KNN.
• Apply all the possible tuning techniques to train the best model for the given data. Select the final best trained model with your comments for selecting this model.

Inferences of KNN without sampling:

Accuracy

Testing accuracy here remains poor with 0.7311827956989247 though training accuracy is high

Confusion Matrix :

1. We can see that Type_S here has **high** precision and recall, which is very good. That means the model is good at detecting postive Type_S cases. 
Also in matrix we can see that only 4 cases has been predicted Type_S wrong i.e. we wrongly predicted normal as Type_S.
And 5 scenario are predicted wrongly out of 48 as Type_H(1) and Normal(4).
Hence f1 score (weighted mean of precesion and recall) is also good.

2. Normal and Type_H here have poor precision, recall and f1 score. Same is reflected in matrix

3. Overall macro avg(average over each category) is poor.

4. Overall weighted average(average over weighted data) is higher due to higher contribution from Type_S

Inferences of KNN with sampling:

Accuracy

Testing accuracy here is better with 0.8444444444444444.

Confusion Matrix :

1. Here Type_S has best precision(1) but recall has dipped here. But overall f1 score remains similar to without sampling. 
Hence this model has not predicted Types_S incorrect as in matrix its 0 and 0 for normal and type_H. 
Though it has predicted few Type_S as Normal (6) and Type_H(2) which affected its recall score.

2. Normal and Type_H here have better precision, recall and f1 score. Same is reflected in matrix

3. Overall macro avg(average over each category) has improved compared to without sampling,which infers better predictions accross all categories.

4. Overall weighted average(average weighted data) has improved compared to without sampling and nearly similar to macro avg.

NOTE : Since its multi class format we are skipping the ROC AUC curve as its becomes complex and inconclusive overall

Since K=1 will be overfitting the data we will be using k =3 for the over sampled data. We have already done for k=3. We can also check for k=2, though its overfitting

We can see that overall all the scores has improved

Accuracy is now 85.2% which is better

There are hyperparameters that are need to be tuned

For example:
    k at KNN
    linear regression parameters(coefficients)
Hyperparameter tuning:
    try all of combinations of different parameters
    fit all of them
    measure prediction performance
    see how well each performs
    finally choose best hyperparameters

Naive Bayes

Inferences of Naive Bayes without sampling:

Accuracy

Testing accuracy here remains poor with 0.7849462365591398 though training accuracy is high.

Confusion Matrix :

1. We can see that Type_S here has **high** precision and recall, which is very good. That means the model is good at detecting postive Type_S cases. 
Also in matrix we can see that only 4 cases has been predicted Type_S wrong i.e. we wrongly predicted normal as Type_S.
And 3 scenario are predicted wrongly out of 48 as Normal(3).
Hence f1 score (weighted mean of precesion and recall) is also good.

2. Normal and Type_H here have poor precision and f1 score. Recall is better for type_H but poor for normal. Same is reflected in matrix

3. Overall macro avg(average over each category) is poor.

4. Overall weighted average(average over weighted data) is higher due to higher contribution from Type_S

Inferences of Naive Bayes with sampling:

Accuracy

Testing accuracy here is better with 0.8148148148148148.

Confusion Matrix :

1. Here type_S has similar precision, recall and f1 score as non-sampled data.

2. Normal and Type_H here have better precision, recall and f1 score. Same is reflected in matrix

3. Overall macro avg(average over each category) has improved compared to without sampling,which infers better predictions accross all categories.

4. Overall weighted average(average weighted data) has improved compared to without sampling and nearly similar to macro avg.

Logistic Regression

Inferences of Logistic Regression without sampling:

Accuracy

Testing accuracy is good with 0.8387096774193549.

Confusion Matrix :

1. Here recall, precision and f1 score are better for all categories as compared to previous model without sampling. Type_S remains with very good score.

3. Overall macro avg(average over each category) is better.

4. Overall weighted average(average over weighted data) is higher due to higher contribution from Type_S

Inferences of Logistic Regression with sampling:

Accuracy

Testing accuracy here is decreased 0.8222222222222222 as compared to without sampling.

Confusion Matrix :

1. Precesion, recall and f1 score has improved here for type_s and normal

2. Type_S has better recesion though recall has decreased.

3. Overall macro avg(average over each category) has improved compared to without sampling,which infers better predictions accross all categories.

4. Overall weighted average(average weighted data) has decreased compared to without sampling and nearly similar to macro avg.

Support Vector Machine

Inferences of SVM without sampling:

Accuracy

Testing accuracy good with 0.8602150537634409.

Confusion Matrix :

1. Here recall, precision and f1 score are better for all categories. Type_S remains with very good score.

3. Overall macro avg(average over each category) is better.

4. Overall weighted average(average over weighted data) is higher.

Inferences of SVM with sampling:

Accuracy

Testing accuracy here is decreased 0.837037037037037 as compared to without sampling.

Confusion Matrix :

1. Precesion, recall and f1 score has improved here for all categories

3. Overall macro avg(average over each category) has improved compared to without sampling,which infers better predictions accross all categories.

4. Overall weighted average(average weighted data) has decreased compared to without sampling and nearly similar to macro avg.

6. Conclusion and improvisation:

• Write your conclusion on the results.

• Detailed suggestions or improvements or on quality, quantity, variety, velocity, veracity etc. on the data points collected by the research team to perform a better data analysis in future.

Conclusion:

For individual model inferences are presented above for each model.

Overall conclusion is below:

Reference:

Precision: When it predicts the positive result, how often is it correct? i.e. limit the number of false positives.

Recall: When it is actually the positive result, how often does it predict correctly? i.e. limit the number of false negatives.

f1-score: Harmonic mean of precision and recall.

Model precision(macro) recall(macro) f1-score(macro) precision-weighted recall-weighted f1-score-weighted Accuracy Remark
KNN without sampling and without optimisation 0.66 0.66 0.66 0.73 0.73 0.73 73.11 Good with Type_S prediction
KNN with sampling and without optimisation 0.85 0.84 0.84 0.86 0.84 0.85 84.44 Improved accuracy compared to without sapling
KNN without sampling and with optimisation k=22 0.79 0.78 0.79 0.82 0.82 0.82 81.72 Better result than without optimisation
KNN with sampling and with optimisation k=2 0.87 0.85 0.85 0.88 0.85 0.85 85.19 Better result than without optimisation
Naive Bayes without sampling 0.73 0.75 0.73 0.79 0.78 0.78 78.5
Naive Bayes with sampling 0.81 0.81 0.8 0.82 0.81 0.81 81.48 Better result for Type_H
Logistic regression without sampling 0.79 0.8 0.79 0.84 0.84 0.84 83.87
Logistic regression with sampling 0.82 0.82 0.82 0.83 0.82 0.83 82.22 Uniform recall and precission across the types
SVM with optimisation and withou sampling 0.83 0.84 0.83 0.87 0.87 0.87 87.1 Good accuracy and recall
SVM with optimisation and with sampling 0.87 0.87 0.87 0.87 0.87 0.87 87.4 Best result in terms of parameters shown

Hence we can say that in terms of overall accuracy and confusion matrix parameter SVM has shown best results.

But depending on requirement and importance of catagory we can again select best model suited.

For example:

If we are interested in identifying the Type_S catagory more we can go for best model with f1 score for type_S category i.e. SVM with optimisation.

Similarly if we are interested in Type_H, we can go for SVM with optimisation f1 of 0.85 for type_H

Similarly if we are interested in normal, we can go for KNN with optimisation f1 of 0.84 for normal

Also if we are interested in reducing false positive results more, we can look for highest precision for that category.

And if we are interested in reducing false negetive results more, we can look for highest recall for that category.

Hence, depending on our requirement, area of focus(False positive or false negetive) and importance of category(normal,types_S,type_H) we can select the best suited model.

• Detailed suggestions or improvements or on quality, quantity, variety, velocity, veracity etc. on the data points collected by the research team to perform a better data analysis in future.

  1. Size of data is small( 310)
  2. Data is not balanced. Balanced data helps in having each class equal weightage. If data is unbalanced, model is influenced more by one category and treats others as noise. Though we are doing oversampling/undersampling, this doesnt give real life results.
  3. S_slope is highly skewed.
  4. There were outliers.
  5. There is collinearity between P_incidence and L_angle, P_incidence and S_slope and others
  6. Type_S data are present in larger number
  7. Type_H data are minimum.
  8. Data are not scaled and missing units to have proper conclusions.

========================================================================================================================

END OF PART ONE

========================================================================================================================

PART TWO

QUESTION:

• DOMAIN: Banking and finance
• CONTEXT: A bank X is on a massive digital transformation for all its departments. Bank has a growing customer base whee majority of them are liability customers (depositors) vs borrowers (asset customers). The bank is interested in expanding the borrowers base rapidly to bring in more business via loan interests. A campaign that the bank ran in last quarter showed an average single digit conversion rate. Digital transformation being the core strength of the business strategy, marketing department wants to devise effective campaigns with better target marketing to increase the conversion ratio to double digit with same budget as per last campaign.
• DATA DESCRIPTION: The data consists of the following attributes:
    1. ID: Customer ID
    2. Age Customer’s approximate age.
    3. CustomerSince: Customer of the bank since. [unit is masked]
    4. HighestSpend: Customer’s highest spend so far in one transaction. [unit is masked]
    5. ZipCode: Customer’s zip code.
    6. HiddenScore: A score associated to the customer which is masked by the bank as an IP.
    7. MonthlyAverageSpend: Customer’s monthly average spend so far. [unit is masked]
    8. Level: A level associated to the customer which is masked by the bank as an IP.
    9. Mortgage: Customer’s mortgage. [unit is masked]
    10. Security: Customer’s security asset with the bank. [unit is masked]
    11. FixedDepositAccount: Customer’s fixed deposit account with the bank. [unit is masked]
    12. InternetBanking: if the customer uses internet banking.
    13. CreditCard: if the customer uses bank’s credit card.
    14. LoanOnCard: if the customer has a loan on credit card.

1. Import and warehouse data:

• Import all the given datasets and explore shape and size of each.

• Merge all datasets onto one and explore final shape and size.

Importing Data

Data1 has 5000 row and 8 columns

Data2 has 5000 row and 7 columns

Merge all datasets onto one and explore final shape and size.

Now we have 5000 rows and 14 columns

2. Data cleansing:

• Explore and if required correct the datatypes of each attribute

• Explore for null values in the attributes and if required drop or impute values.

Data types are int and floats. We need to change data type for categorical variables

Data are changed to categorical value where required

We can see that there are null values on LoanOnCard

Since we have only 20 null values out of 5000, we can drop the null value rows

All the null values are now dropped

3. Data analysis & visualisation:

• Perform detailed statistical analysis on the data.

• Perform a detailed univariate, bivariate and multivariate analysis with appropriate detailed comments after each analysis.

* Age : Here mean and median are almost same we can say thet the data is normal with little or no skewness
* CustomerSince : Here mean and median are almost same we can say thet the data is normal with little or no skewness
* HighestSpend : For HighestSpend mean> median. So Positive skewness will exist
* MonthlyAverageSpend : For MonthlyAverageSpend mean> median. So Positive skewness will exist
* Mortgage : There high fluctuation in mortage column. 50% of data has zero values but maximum value is 635. This is hugely affected by outliers

Univariate Analysis

No outliers present here

The data is normally distributed and wider in middle

People with age between 35 to 65 are more

No outliers present here

The data is normally distributed and wider in middle

Most of the people became customer of bank between 10 to 30.

Here 96 outliers are present

Data is skewed positively

In one transaction highest spend amount is between 45 to 100. Fewer customer spend more than approx 200

There are 324 outliers

Huge positive skewness is present

We can say that few customers spend hugely monthly with repect to others

Most of the zipcodes are appeared once.

There is no proper distribution

This column does not add value to the model. We will drop during model building

Most of the values are 0

There are huge outliers and distribution is not normal

We can say that most of the customers dont have mortgages

Hidden score is almost equally distributed

Level 2 and 3 are almost equally distributed

Level 1 is slightly higher

89.6% customers dont have security

93.9 customers dont have Fixed deposit

Customer using internet banking are just slightly more than who are not

Only 29.4% of customer uses bank’s credit card.

90% of customers does not have loan on credit card

Bi Variate Analysis

Customer with the bank since is equally distributed among loan holder and non-loan holder

Mean values are nearly equal for both loan holders and non loan holders

1. People without loan have lesser mean highest spend than people with loan(Box on loan category above box on non-loan category)

2. People without loan sometimes have more highest spend than people with loan(As we can see outlier in case of non-loan category)

3. Mean of highest spent is more for people with loan

1. People without loan have lesser mean monthly average spend than people with loan(Box on loan category above box on non-loan category)

2. People without loan sometimes(Outliers) have more monthly average spend than people with loan(As we can see outlier in case of non-loan category)

3. Mean of highest spent is more for people with loan

4. Graphically, the behavior of montly average spend is quite similar to highest spend compare to loan on card

1. There are outlier mortgages in case of both people with loan and without loan

2. Mortgages are higher for people with loan

Age distribution is nearly equal for both loan holders and non-loan holders

Mean of age for both loan holders and non-loan holders is similar

1. We can see linear relationship between monthly average spend and Highest spend

2. Customer since and age are highly correlated

Correlation among pairs of continuous variables

Age and customer since has correlation as 1. Any one can be used for the model.

Highest spend has high correlation with monthly average spend

Mortage and highest spend has little correlation

Rest of the variables does not have any relation

Hidden score is high for customer without loan

Level is high for customer without loan

security is high for customer without loan

People without fixed deposit are more and without loan

Multivariate Analysis

We can clearly see that loan holder spend more money monthly.Particulary the spend is more on level 2 and 3

Internet banking doesnt affect monthly spent as line is flat for both loan customer and non-loan customers

Montly average spendis slightly more for credit card holders for people with loan

Hypothesis Testing

We can check whether all the independent variables has significant effect on target variables

Highest Spend mean values differ huge for loan and non-loan customer

Age mean is equal for each class

Hypothesis Testing of continuous feature with target variable

Here we will be using two-sample unpaired t-test

Ho(Null Hypothesis):There is no significant difference in independent feature with different category of Target variable

H1(Alternate Hypothesis):There is significant difference in independent feature with different category of Target variable

We can see Age,Customer since and zip code does not have effect on target variable. So dropping this column before building model

Statistical Testing of categorical features with target variable

Ho: There is no significant difference in hidden score for different category of target variable(Loan on card)

H1: There is significant difference in hidden score for different category of target variable(Loan on card)

We can see credit card,internet banking and security does not have significant difference in target variable.So dropping this column before building model

Montly average spend is slightly more for FD Account holders in both cases of loan

Data pre-processing:

• Segregate predictors vs target attributes

• Check for target balancing and fix it if found imbalanced.

• Perform train-test split.

Segregate predictors vs target attributes

• Check for target balancing and fix it if found imbalanced.

As we can see in graph, There is huge imbalance in target variable.

If the imbalanced data is not treated beforehand, then this will degrade the performance of the classifier model. Most of the predictions will correspond to the majority class and treat the minority class features as noise in the data and ignore them. This will result in a high bias in the model.

• Check for target balancing and fix it if found imbalanced.

• Perform train-test split.

Non-Sampled Data split

Sampled Data split

5. Model training, testing and tuning:

• Design and train a Logistic regression and Naive Bayes classifiers.

• Display the classification accuracies for train and test data.

• Display and explain the classification report in detail.

• Apply all the possible tuning techniques to train the best model for the given data. Select the final best trained model with your comments for selecting this model.

Inference on Logistic regression before sampling

Before Sampling

95% accuracy on training set and 94% accuracy on test set.

Here training set accuracy and testing accuracy are balanced when model is built without sampling also accuracy is good

In the above confusion matrix 59 and 27 are the errors in the model.

Here you can see model is poor in predicting class 1 compared to class 0

Accuracy is good but in this case we need to look on recall value

Here Recall tells that only 60% class 1 is predicted correctly from actual values

We dont have enough sample of class 1 to train the model.

We will do the sampling and check how recall values improves in this case.

Inference on Logistic regression after sampling

After Sampling :

Here both accuracy is reduced after sampling. Let us check on the classification report.

Now we can see recall value is improved after sampling.

So the imbalanced target we used sampling method to balance the data.

f1 score here has decreased as the data is more balanced now. Due to imbalanced data class one is considered as noise.

Here we can see macro avg and weighted avg as similar which is good depicting balanced data.

NaiveBayes

Inference on NaiveBayes before sampling

Here accuracy in test data slightly less compared to training data.

Recall value is bad for class 1

Recall value for class 1 is less in naive bayes model compared to logistic regression.

Inference on NaiveBayes after sampling

Here accuracy in test data slightly less compared to training data.

Recall value is good for both classes

Recall value for class 1 is less in naive bayes model compared to logistic regression.

Perceision and f1 score is also lesser compared to logistic regression

Inference on KNN

Here accuracy in test data slightly less compared to training data. We have got best accuracy among the models till now.

Recall value is good for both classes

Recall value is more than previous models used.

Perceision and f1 score is also greater than previous models

Since K=1 will be over fitting the model, we will take next best K i.e. 5

Since K=1 will be over fitting the model, we will take next best K i.e. 3

Inference on SVM

Here accuracy in test data slightly less compared to training data. We have got best accuracy among the models till now.

Recall value is good for both classes

Recall value is similar to KNN.

Percision and f1 score is also similar to KNN

Inference on SVM after optimization

We have got best accuracy among the models till now i.e. 95.74.

Recall(0.96) value is best till now

Perceision(0.96) and f1(0.96) score is best till now.

6. Conclusion and improvisation:

• Write your conclusion on the results.
• Detailed suggestions or improvements or on quality, quantity, variety, velocity, veracity etc. on the data points collected by the
bank to perform a better data analysis in future.

Conclusion:

Reference:

Precision: When it predicts the positive result, how often is it correct? i.e. limit the number of false positives.

Recall: When it is actually the positive result, how often does it predict correctly? i.e. limit the number of false negatives.

f1-score: Harmonic mean of precision and recall.

Model precision(macro) recall(macro) f1-score(macro) precision-weighted recall-weighted f1-score-weighted Accuracy
Logistic regression without sampling 0.86 0.79 0.82 0.94 0.94 0.94 94.24
Logistic regression with sampling 0.88 0.88 0.88 0.88 0.88 0.88 88.18
Naive Bayes without sampling 0.72 0.72 0.72 0.90 0.90 0.90 90.22
Naive Bayes with sampling 0.86 0.86 0.86 0.86 0.86 0.86 86.11
KNN without sampling & with optimisation k=5 0.85 0.74 0.78 0.93 0.94 0.93 93.78
KNN with sampling & with optimisation k=3 0.94 0.94 0.94 0.94 0.94 0.94 94.59
SVM without optimisation & with sampling 0.94 0.94 0.94 0.94 0.94 0.94 94.37
SVM with optimisation & with sampling 0.96 0.96 0.96 0.96 0.96 0.96 95.74

Inference on Logistic regression before sampling

Before Sampling

95% accuracy on training set and 94% accuracy on test set.

Here training set accuracy and testing accuracy are balanced when model is built without sampling also accuracy is good

In the above confusion matrix 59 and 27 are the errors in the model.

Here you can see model is poor in predicting class 1 compared to class 0

Accuracy is good but in this case we need to look on recall value

Here Recall tells that only 60% class 1 is predicted correctly from actual values

We dont have enough sample of class 1 to train the model.

We will do the sampling and check how recall values improves in this case.

Inference on Logistic regression after sampling

After Sampling :

Here both accuracy is reduced after sampling. Let us check on the classification report.

Now we can see recall value is improved after sampling.

So the imbalanced target we used sampling method to balance the data.

f1 score here has decreased as the data is more balanced now. Due to imbalanced data class one is considered as noise.

Here we can see macro avg and weighted avg as similar which is good depicting balanced data.

Inference on NaiveBayes before sampling

Here accuracy in test data slightly less compared to training data.

Recall value is bad for class 1

Recall value for class 1 is less in naive bayes model compared to logistic regression.

Inference on NaiveBayes after sampling

Here accuracy in test data slightly less compared to training data.

Recall value is good for both classes

Recall value for class 1 is less in naive bayes model compared to logistic regression.

Perceision and f1 score is also lesser compared to logistic regression

Inference on KNN

Here accuracy in test data slightly less compared to training data. We have got best accuracy among the models till now.

Recall value is good for both classes

Recall value is more than previous models used.

Perceision and f1 score is also greater than previous models

Inference on SVM

Here accuracy in test data slightly less compared to training data. We have got best accuracy among the models till now.

Recall value is good for both classes

Recall value is similar to KNN.

Perceision and f1 score is also similar to KNN

Inference on SVM after optimization

We have got best accuracy among the models till now i.e. 95.74.

Recall(0.96) value is best till now

Perceision(0.96) and f1(0.96) score is best till now.

FINAL Conclusion :

Hence we can say that in terms of overall accuracy and confusion matrix parameter logistic regression, KNN, SVM has shown fairly good results.

Logistic Regression is not affected by overfitting and it is also has good recall value.

Sampling improved to predict minority classes as well

Suggesting to collect data equally for both the classes.

Few customers does't have credit card but those customer having loan on card. This data error can be avoided

Its prefer to choose Logistic regression (balanced data) or GaussianNB as model.

Overfitting is less is Logistic regression.

Precision & recall values are better predicting the protential customers.

Banking domain prefer to see the precision than the recall as to avoid false negative.

We dont have enough sample of class 1 to train the model.

Sampling improved to predict minority classes as well

Suggesting to collect data equally for both the classes.

Few customers does't have credit card but those customer having loan on card. This data error can be avoided

Missing data

High collinearity between data and similar columns.

In profiling, warning we see high colinnearity too.

CustomerSince has 66 (1.3%) zeros

MonthlyAverageSpend has 106 (2.1%) zeros which is not correct in reality.

===========================================================================================

END